ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group98c.txt / 000007_icon-group-sender _Thu Sep 10 12:24:31 1998.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 6KB

Return-Path: <icon-group-sender> Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239]) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id MAA26222 for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Thu, 10 Sep 1998 12:24:31 -0700 (MST) Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM) id AA31099; Thu, 10 Sep 1998 12:24:04 -0700 From: gep2@computek.net Date: Thu, 10 Sep 1998 13:46:57 -0500 (CDT) Message-Id: <199809101846.NAA18441@mail.cmpu.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Unicode support or support for non-Ascii based character manipulation? To: icon-group@optima.CS.Arizona.EDU X-Mailer: SPRY Mail Version: 04.00.06.17 Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Errors-To: icon-group-errors@optima.CS.Arizona.EDU Content-Transfer-Encoding: 7bit Status: RO > Icon has been a very interesting language for string manipulation, Certainly! If not the MOST interesting language for such purposes. > however, the limit of supporting only ASCII Actually, that's not really true. Icon is much more free of "supporting ONLY ASCII" than C, for example. (I don't know how true this is about things like conversions... does Icon automatically support EBCDIC character assignments, for example, if generated on an EBCDIC system?) Certainly though there are issues that come up with supporting international characters, and in part that's due to the fact that there doesn't seem to be any real international agreement on how (at least some) other natural languages map their alphabetic characters into (at least) an 8-bit byte. Hebrew is one example, where there seem to be at least three different competing "standards" for where the characters are mapped. > makes it less useful for non-English language work. Well, I agree that it's perhaps less useful than it MIGHT be, but I still suspect it's (far!) more useful than OTHER programming languages are for these kinds of things. > With the computer industry heading towards Unicode support, Okay, I don't dispute that this move is happening but personally I still don't very much like it. The fact is that (at least here in the Western Hemisphere, where probably most of the world's computers are used) an eight-bit byte is already quite sufficient for most purposes, and doubling it comes at a cost in complexity and storage (RAM, disk, tape, whatever) which is simply very, very hard to justify on any genuine economic basis. If other countries have more difficult (or huge) character sets, that is (while a fact of life) simply an inherent disadvantage of their culture (and note that I'm not intending that as a slam or value judgement, it just IS the way it is), and I don't see a terribly convincing argument why the other countries (without that disadvantage) ought to pay the price too, just in order to artificially level the playing field. > ...it should be possible to begin including support for non-English and non alphabetic languages. I think that a lot of the basic manipulations and features in Icon (tables, sets, etc) are probably insensitive to the character mapping used. And Icon does seem to be pretty much (totally?) eight-bit clean (unlike C), which at least gives one the ability to construct stuff on top of it to support other languages. One issue, of course, is the one I mentioned earlier... conversions, although numeric formatting is one other specific example of a potential problem area. Certainly not all cultures prefer Arabic numerals. Another issue, perhaps unique to Icon, is the implementation of "character set" datatypes, which I'd suspect would end up being quite different for a language containing 65,536 distinct characters... since the character set data representation, presumably unless a different implementation technique were used, would be not twice but 256 times larger than for an eight-bit character set. I can certainly understand and appreciate the problems that the huge character sets used in some eastern countries have played for them, and frankly have been surprised by the extent to which solutions for things like keyboards have been mastered. And text processing with such large character sets certainly must represent a whole series of unique challenges, so I can understand the interest in those countries in something like Icon for attacking them. > Has anyone thought about this yet? What does string and pattern matching mean in, for example, Japanese? I have given the matter some thought, although just as an 'outside observer'. I would presume that a "full/nice" implementation for such languages would result in simply processing Unicode-like 16-bit characters, with everything that involves. At *some* point, barring having complete 16-bit-byte uniformity across everything from CPUs and operating systems to peripheral devices, there might have to be some conversions and "glue" interface work done, and classically it's at those border/edge regions that the seams tend to be less than pretty. Certainly one of the more interesting Icon-related issues I've seen come up here in a while. I seem to recall it was mentioned briefly some time ago (perhaps that was on the SNOBOL4 list instead?) but didn't go very far at the time. Gordon Peterson http://www.computek.net/public/gep2/ Support the Anti-SPAM Amendment! Join at http://www.cauce.org/